Introduction to rDolphin

Daniel Canueto

rDolphin is an R package that performs the automatic profiling of 1D 1H NMR spectra datasets and outputs several indicators of quality of quantification and identification for each signal area quantification. To perform a reliable automatic profiling, resilient to the multiple factors that can incorporate variability to a dataset, rDolphin splits the spectrum into Regions of Interest (ROIs). Each ROI has a specific method of quantification (integration or deconvolution; with or without baseline fitting), a list of signals to quantify and a list of signal parameters to monitor (e.g. the chemical $chemical_shift variability).

In addition, rDolphin comes with tools to load individual quantifications to be evaluated and updated or to perform different kinds of exploratory analyses. For researchers with non-expert programming skills, the package incorporates a Shiny GUI.

Please run these commands in the R console to install the package:

  1. install.packages("devtools") #installation of package necessary to install rDolphin from Github
  2. devtools::install_github("danielcanueto/rDolphin", build_vignettes = TRUE) #installs rDolphin
  3. library(rDolphin) #loads rDolphin

If, at some moment, the installation fails, probably the reason is a missing dependency. Please read on the console for “there is no package called” messages, install the package required with install.packages and run devtools::install_github("danielcanueto/rDolphin", build_vignettes = TRUE) again.

Introduction Structure

The introduction will show how to perform in the R console the profiling of a subset of 30 spectra of MTBLS242, a Metabolights dataset of blood spectra. The 30 spectra belong to 15 patients with blood samples taken at two different times.

A link to an online document showing how to perform similar steps through the Shiny GUI is available in the Readme file of the Github website: https://github.com/danielcanueto/rDolphin.

Import of necessary data

In the extdata folder of the package there is the necessary information to load a dataset of 30 spectra of MTBLS242. Run this command to find the path where this information is located on your computer:

file.path(system.file(package = "rDolphin"),"extdata")

You will find in the folder the next files:

A thorough description of the structure of every file necessary to input to the tool is available here: https://docs.google.com/document/d/1t--gF5mCBNhbGvn53vKth2nlTzucLx55EfUWIgrMh_o/edit?usp=sharing

Run these commands to set as directory the extdata folder and to import the data contained there:

setwd(file.path(system.file(package = "rDolphin"),"extdata"))
imported_data=import_data("Parameters_MTBLS242_15spectra_5groups.csv")

An imported_data list is generated with several variables. Feel free to investigate the information present in each one of them. Some important ones are dataset, ppm or ROI_data.

Exploratory analysis of spectra dataset

rDolphin eases the analysis of the dataset complexity through two kinds of interactive Plotly figures:

For each kind of figure, a red trace appears below the spectra. This red trace gives the results of a univariate analysis for every bin and helps the user detect interesting regions to profile. Feel free to play with the interactivity of the Plotly figure.

If you do not know how to annotate a signal in the dataset, you can evaluate possible options in the biofluid studied (blood) ranked by probability thanks to signal repository included with the package. For example, these commands:

ind=intersect(which(imported_data$repository$Shift..ppm.<3.37),which(imported_data$repository$Shift..ppm.>3.35))
View(imported_data$repository[ind,])

help see which signals in blood matrix are seen in the 3.37-3.35 ppm region. A methanol singlet stands out in the results. Acconrdingly, a high intensity singlet appears in our dataset. The filtering by biofluid avoids wrong annotations typical when using general databases.

In addition, the user can also visualize the results of STOCSY or RANSY in a region of the dataset. These are the results achieved for a glutamine signal at 2.14-2.12 ppm:

identification_tool(imported_data$dataset,imported_data$ppm,c(2.14,2.12),method='spearman')
alt text

alt text

The other important glutamine signal at 2.45-2.4 ppm stands out.

Evaluation of ROI information

Looking to the performance of profiling in a model spectrum can help to improve the parameters in some ROIs and to check additional regions of the spectrum with the potential to give interesting insights to a study.

Run the next command:

profiling_model=profile_model_spectrum(imported_data,imported_data$ROI_data)

profiling_model has three elements:

Feel free to modify cells of the ROI information contained in imported_data$ROI_data or to add or remove ROIs and to repeat the model spectrum profiling process.

Automatic profiling of spectra

When you are satisfied with the ROI data contained in imported_data$ROI_data, you can perform an automatic profiling of the spectra:

profiling_data=automatic_profiling(imported_data,imported_data$final_output,imported_data$reproducibility_data,imported_data$ROI_data)

profiling_data has two variables:

To output in your computer CSV files with the final_output data, run this command:

write_info('output_info',profiling_data$final_output,imported_data$ROI_data)

Go to the output_info folder stored in the extdata folder mentioned in Import of necessary data and you will find the outputted information.

To output in your computer PDFs with plots of every quantification, run this command:

write_plots('',profiling_data$final_output,imported_data,profiling_data$reproducibility_data)

A plots folder is stored in the extdata folder mentioned in Import of necessary data. You can see a PDF file for every signal with all quantifications.

Validation of quantifications

The validation function allows to analyse possible wrong identifications or quantifications. For example, these commands:

validation_data=validation(profiling_data$final_output,profiling_data$alarm_matrix,3)
library(DT)
datatable(round(validation_data$shown_matrix,4),selection = list(mode = 'single', target = 'cell')) %>% formatStyle(colnames(validation_data$shown_matrix), backgroundColor = styleInterval(validation_data$brks, validation_data$clrs))

help show the difference between the expected chemical $chemical_shift and the calculated chemical $chemical_shift in every signal quantification. The datatable function of DT package helps easily identify the cells with a darker red as the more suspicious quantifications.

If the user wants to load the information and the plot of these suspicious quantifications, he can do it through the load_quantification function. Run these commands to load the TSP quantification of the first experiment:

loaded_quantification=load_quantification(profiling_data$reproducibility_data,imported_data,profiling_data$final_output,list(row=1,col=1),imported_data$ROI_data)

The p element of loaded_quantification shows an interactive figure of the quantification in order to check its quality. It can be observed that the signal fitted is narrower than the real one. This effect is caused by the binding of the TSP to the protein: this binding creates excessive half bandwidth variability in the TSP signal in a spectra dataset. The ROI_par element shows the ROI parameters used during this quantification.

As there is neither baseline nor any overlapping signal, it is more optimal to change the quantification mode for the TSP signal to “Clean Sum”, so integration instead of fitting is performed:

imported_data$ROI_data[1,3]="Clean Sum"

And to update the ROI profiling in all experiments of the dataset (inputting 1:nrow(imported_data$dataset) ) through the individual_profiling function):

updated_profiling_data=individual_profiling(imported_data,imported_data$final_output,1:nrow(imported_data$dataset),imported_data$ROI_data[1,,drop=F],imported_data$reproducibility_data)

Uni and multivariate analysis of fingerprinting and profiling data

Univariate analyses of every bin could be already seen with the profile_model_spectrum function. The p values calculated can be outputted with the p_values function:

pval=p_values(imported_data$dataset,imported_data$Metadata)

However, the big advantage of profiling data to fingerprinting data is its resilience to overlapping and to chemical $chemical_shift variability (typical in urine) or baseline (typical in blood). Basic univariate analyses of profiling data can be evaluated changing the data called in p_values:

pval=p_values(profiling_data$final_output$quantification,imported_data$Metadata)

Finally, dendrogram heatmaps can show us interesting subsets of samples or signals with the information provided. For example, a not identified signal can be highly correlated in quantification with another signal because they are signals from the same metabolite:

type_analysis_plot(profiling_data$final_output$quantification,profiling_data$final_output,imported_data,'dendrogram_heatmap')

And now, what do I have to do?

After this introduction to the wide range of options to perform the automatic profiling of spectra datasets and to maximize the quality of this profiling, feel free to investigate the inputs and outputs of every function.

You should investigate how to adapt the ROI profiles to the matrix studied. We share in the rDolphin website the ROI information we have found optimal in several matrixes. This ROI information will have to be adapted to the changes caused by the lab protocol during sample preparation or during spectrum acqusition and pre-processing.